The discussion below is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
With the explosive growth of the Internet, the ability to obtain information on just about any topic is possible. Although queries provided to search engines may take any number of forms, one particular form that occurs frequently is a “definitional question.” A definitional question is a question of the type such as but not limited to “What is X?”, “Who is Y?”, etc. Statistics from 2,516 Frequently Asked Questions (FAQ) extracted from Internet FAQ Archives (http://www.faqs.org/faqs/) shows that around 23.6% are definitional questions, thereby validating the importance of this type of question.
A definitional question answering (QA) system attempts to provide relatively long answers to such questions. Stated another way, the answer to a definitional question is not a single named entity, quantity, etc., but rather a list of information nuggets. A typical definitional QA system extracts definitional sentences that contain the most descriptive information about the search term from a document or documents and summarizes the sentences into definitions.
Many QA systems utilize statistical ranking methods based on obtaining a centroid vector (profile). In particular, for a given question, a vector is formed consisting of the most frequent co-occurring terms with the question target as the question profile. Candidate answers extracted from a given large corpus are ranked based on their similarity to the question profile. The similarity is normally the TFIDF score in which both the candidate answer and the question profile are treated as a bag of words in the framework of Vector Space Model (VSM).
VSM is based on an independence assumption. Specifically, VSM assumes that terms in a vector are statistically independent from one another. However, terms in an answer or nugget are based on a sentence where the words are commonly not independent. For example, if a definitional question is “Who is Tiger Woods?”, a candidate answer may include the words “born” and “1975”, which are not independent. In particular, the sentence may include the phrase “ . . . born in 1975” . . . . However, the existing VSM framework does not accommodate term dependence.